What is word sense disambiguation good for? 
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Abstract 

Word sense disambiguation has developed as a sub-area 
of natural language processing, as if, like parsing, it was 
a well-defined task which was a pre-requisite to a wide 
range of language-understanding applications. First, I 
review earlier work which shows that a set of senses for 
a word is only ever defined relative to a particular hu- 
man purpose, and that a view of word senses as part of 
the linguistic furniture lacks theoretical underpinnings. 
Then, I investigate whether and how word sense ambi- 
guity is in fact a problem for NLP applications. 

What word senses are not 

There is now a substantial literature on the problem of 
word sense disambiguation (WSD). The goal of WSD 
research is generally taken to be disambiguation be- 
tween the senses given in a dictionary, thesaurus or sim- 
ilar. The idea is simple enough and could be stated as 
follows: 

Many words have more than one meaning. When a 
person understands a sentence with an ambiguous 
word in it, that understanding is built on the basis 
of just one of the meanings. So, as some part of the 
human language understanding process, the appro- 
priate meaning has been chosen from the range of 
possibilities. 

Stated in this way, it would seem that WSD might be 
a well-defined task, undertaken by a particular mod- 
ule within the human language processor. This module 
could then be modelled computationally in a WSD pro- 
gram, and this program, performing, as it did, one of 
the essential functions of the human language processor, 
would stand alongside a parser as a crucial component 
of a broad range of NLP applications. 

There are problems with this view. The simplest 
stems from the observation that different dictionaries 
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very often give different sets of senses for a word. A 
closer investigation reveals a lack of theoretical foun- 
dations to the concept of 'word sense'. The concept is 
intimately connected to our knowledge and experience 
of dictionaries, but these are social artifacts created to 
satisfy such human purposes as playing word-games, re- 
solving family arguments, and making profits for pub- 
lishers. Amid all these competing goals, the pursuit of 
truth is not always dominant. 

In particular, a standard dictionary specifies the 
range of meaning of a word in a list, possibly nested, 
of senses. This is not the outcome of an analysis of 
how word-meaning operates, but is, rather, a response 
to constraints imposed by: 

• tradition 

• the printed page 

• compactness 

• a single, simple method of access 

• resolving disputes about what a word does and does 
not mean. 

The format of the dictionary has remained fairly sta- 
ble since Dr. Johnson's day. The reasons for the format, 
and the reasons it has proved so resistant to change and 
innovation, are explored at length in Nunberg (1994 ). 
In short, the development of printed discourse, particu- 
larly the new periodicals, in England in the early part of 
the eighteenth century brought about a re-evaluation of 
the nature of meaning. No longer could it be assumed 
that a disagreement or confusion about a word's mean- 
ing could be settled face-to-face, and it seemed at the 
time that the new discourse would only be secure if 
there was some mutually acceptable authority on what 
words meant. The resolution to the crisis came in the 
form of Johnson's Dictionary. Thus, from its inception, 
the modern dictionary has had a crucial symbolic role 



as in-principle arbiter of disputes. Hence "the dictio- 
nary" , with its imphcations of unique reference and au- 
thority (cf. "the Bible"). Further evidence for this po- 
sition is to be found in McArthur (1987), for whom the 
"religious or quasi-religious tinge" (p 38) to reference 
materials is an enduring theme in their history; ^um- 



mers ( 198§| ), whose research into dictionary use found 



that "settl[ing] family arguments" was a major use (p 



114, cited in Bejoint (1994, p 151)); and Moon (1989) 



who catalogues the use of the UAD (Unidentified Au- 
thorising Dictionary) from newspapers letters pages to 
restaurant advertising materials (pp 60-64). 

To solve disputes about meaning, a dictionary must 
be, above all, clear. It must draw a line around a mean- 
ing, so that a use can be classified as on one side of the 
line or the other. A dictionary which dwells on marginal 
or vague uses of a word, or which presents word meaning 
as context-dependent or variable or flexible, will be of 
little use for purposes of settling arguments. The pres- 
sure from this quarter is for the dictionary to present 
a set of discrete, non-overlapping meanings for a word, 
each defined by the necessary and sufficient conditions 
for its application — whatever the facts of the word's 
usage. 

Lexicographers are vividly aware of the problem. 
They have frequently lamented the possibly-nested list 
model [Stock (1983| ; [Hanks (19*9^ ; [Fillmore and Atkins 
(1992|. They know all too well the injustice it fre- 
quently does to a word's range of meaning and use. 
But WSD researchers, at least until recently, have gen- 



erally ipiuceeded as if Lhis was uuL Lhe case, as if a siu- 
gle progiam — disambiguating, peiliaps, in its Euglisli- 
language version, between the senses given in some hy- 
brid descendant of Merriam- Webster, LDOCE, COM- 
LEX, Roget, OALDCE and WordNet —would be rele- 
vant to a wide range of NLP applications .R 

The sets of word senses presented in different dic- 
tionaries and thesauri have been prepared, for various 
purposes, for various human users: there is no a pri- 
ori reason to believe those sets are appropriate for any 
NLP apphcation.0 

It seems likely that NLP application lexicons — which 



vant to the application. They might not encounter word 
sense ambiguity on anything like the scale that a brief 
glance at a dictionary (or at the WSD literature) would 
suggest. The remainder of the paper addresses whether 
this is so, and what scale of problem word sense ambi- 
guity causes for different varieties of NLP application.^ 

Taxonomy 

First, let us distinguish five types of application for 
which WS ambiguity is potentially an issue: 

• Information Retrieval (IR) 

• Machine Translation (MT) 

• Parsing (and, implicitly, all those applications for 
which parsing is one stage of processing) 

• Lexicography 

• Residual, 'core' language understanding (including 
database front ends, dialogue systems. Information 
Extraction as in MUC) — hereafter NLU. 

IR 

The intellectual affinities of most recent WSD work are 
with IR. The problem of finding whether a particu- 
lar sense applies to an instance of a word can be con- 
strued as equivalent to the essential IR task of finding 
whether a document is relevant to a query. The homol- 
ogy is made explicit at various points in the literature 
(Gale, Church, and Yarowsky, 1992; Gale, Church, and 



Yarowsky, 1993) 



are, iii the mid lyyus, almost invariably hand-built 
rather than MRD-derived — will be application-driven 
rather than resource-driven, so will only contain the 
word senses and make the word sense distinctions rele- 



Most work in IR disregards syntactic structure en- 
tirely, 'stemming' words so that clean, cleaner, clean- 
ing and cleaned are all mapped to clean, and then treats 
a document as a bag of stems. It does not use POS- 
tagging or name-recognition, although these are rela- 
tively mature and reliable technologies for these tasks 
within NLP, and parsing has not been found to improve 
IR performance: the linguistic processing has not been 
fast, robust or portable enough, and it is not in any case 
clear whether it provides relevant informat ion for the IR 
task. This is very much a live issue: se e ptrzalkowski 
(1994 ), Strzalkowski and Vauthey (1995| ) for recent ev- 



^The most promising recent WSD work is moving away 
from this position, determining the senses between which 
the program is to dis ambiguate eith er directly from the 
clusters in the corpus ( pchiitze. 19971 ); or through a small 
amount of huma n input ( [Clear, 1994 ), or a choice of either 
( [Yarowsky, 199^ ). 

For a full account of the nature of word senses, in dic- 
tionaries and elsewhere, see Kilgarriff (1992; 1993; 1997b). 



idence of the potential of NLP in IR. However, to date, 
IR has made progress through applying sophisticated 
statistical techniques to documents treated as objects 
without linguistic structure, and this is the approach to 
WSD which has recently fiourished. 

Within IR, WSD can be viewed as an alternative to 
NLP, rather than a technique within it. If a statistical 
model based on a bag of stems is inadequate, one way to 



My sources include an informal email survey on the 
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get closer to the meaning of a text is WSD; another is a 
hnguistically-informed technique such as parsing. They 
are not mutuaUy exclusive, but nor are they readily 
compatible. 

A high proportion of WSD research is oriented to- 
wards IR, yet it is not clear whether WSD has the po- 
tential to significantly improve IR performance. In the 



first careful study of the question, Krovetz and Croft 
(1992| conducted some experiments which suggested 
that WS-ambiguity causes only limited degradation of 
IR performance. Their experiments were on the small, 
specialist CACM corpus. They used a standard set 
of queries for which "correct answers" are available. 
They compared system performance 'with ambiguity' 
and 'without ambiguity': the 'with ambiguity' condi- 
tion was the normal situation, while for the 'without 
ambiguity' condition, all relevant terms had been man- 
ually disambiguated, in a simulation of a perfect WSD 
program. For this corpus and query-set, they concluded 
that a perfect WSD program would improve perfor- 
mance by 2%. 

Sanderson (1994) performed a similar experiment us- 
ing pseudo- words. A pseudo-word is a word formed 
by 'pretending' that two distinct words were a single 
word with two meanings, one corresponding to each 
of the original words. Thus the pseudo- word banana- 
kalashnikov could be formed by replacing all instances 
of banana and kalashnikov in a corpus by banana- 
kalashnikov: then a WSD program would have the task 
of determining which were originally bananas, which 
kalashnikovs. The method allowed Sanderson to reg- 
ulate the degree of ambiguity in the corpus, and to 
model both accurate and inaccurate WSD programs. 
He found that introducing extra ambiguity did little 
to degrade performance, but, when the WSD algorithm 
made mistakes, this did do harm. Also, in longer queries 
the different words in the query will tend to be mutu- 
ally disambiguating, so WSD is probably only relevant 
where the query is very short. He concludes "the per- 
formance of [IR] systems is insensitive to ambiguity but 
very sensitive to erroneous disambiguation" (p 149). 

Schiitze (1997) first distinguishes sense discrimina- 
tion from disambiguation. Discrimination involves 
identifying distinct senses and classifying occurrences 
of the word as belonging to one of those senses. It does 
not involve labelling the senses (which correspond to 
clusters of occurrences) or associating them with any 
external knowledge source such as a dictionary. Thus, 
in keeping with the spirit of this paper, his senses are 
automatically devised to match the corpus. System per- 
formance improved by up to 4.3%.n with the addition 



of the disambiguation module (and the added sophisti- 
cation that a word can be assigned to more than one 
word sense, where it is 'near' more than one in vector 
space) . 

It is debatable how important an improvement of 2 
or 4 percentage points is. On the one hand, WSD will 
clearly not revolutionise IR or render it a solved prob- 
lem. But IR is a fairly mature technology, very widely 
used by millions of users, and an average 4% improve- 
ment across all those users and all their many queries 
could be seen as very significant indeed. 

Machine Translation 

In IR, it is generally difficult to assign blame for poor 
performance to word sense ambiguity or any other spe- 
cific source. MT, by contrast, wears its mistakes on its 
sleeve. It is abundantly clear to all in MT that word 
sense ambiguity is a huge problem. 

The literature has surprisingly little to say about it. 
Hutchins and Somers (1992| ) point out the two variants 
of the problem: monolingual ambiguity, where the word 
is ambiguous in the source language, and translational 
ambiguity, where speakers of the source language do 
not consider the word ambiguous but it has two pos- 
sible translations, as when English blue is translated 
differently into Russian according to whether it is light 
blue or dark. 

MT is a technology rather than a science. MT sys- 
tems generally take a decade from idea to marketplace, 
so the theory available at their inception is destined to 
be out of date by the time they perform. Thus no re- 
cent WSD work is employed in existing MT systems. 
They use extensive sets of selection restrictions paired 
with semantic features to make it possible for the sys- 
tem to make the correct lexical choice. MT systems 
usually use a number of very large lexicons where selec- 
tion restriction information, designed to resolve ambi- 
guity problems, accounts for a large proportion of the 
bulk. The SYSTRAN English-French lexicon respon- 
sible for word choice contains 400 rules governing the 
one English word, oil, and when it should be translated 
as huile, when petrole (Hutchins and Somers, 1992, p 
179). 

One paper which does bring state-of-the-art WSD 
to bear on Machine Translation, albeit in experimen- 



tal mode, is Dagan and Itai (1994). They use a bilin- 
gual lexicon to identify the possible translations, and 
a parsed target language corpus to gather information 
about the 'tuples' in which each of the possible trans- 
lations is often found. A 'tuple' comprises a gram- 



*They cite an improved average precision (over 11 levels 
of recall) of 14.4% compared to the baseline, from 29.9% to 



34.2%. This improvement is 4.3% in absolute terms, but 
14.4% when calculated as an improvement on the baseline 
performance. 



matical relation, such as subject-verb, and the oc- 
cupier of each of the slots of that relation, so "The man 
walked home" would give the triple (subject-verb, 
man, walk). The source-language text to be translated 
in then parsed, to give a source language tuple. The 
bilingual dictionary and the target-language statistics 
are then used to find the best match. 

The paper applies sophisticated WSD to a real prob- 
lem, with the discriminations that the system makes 
being defined by the needs of the application. 

Parsing 

Accurate parsing is a requirement for a wide range of 
NLP applications, so if WSD is critical for parsing ac- 
curately, it is, by implication, significant for all those 



apphcations that depend on parsing. McCarthy (1997) 
explores WSD methods explicitly for purposes of im- 
proving parsing. Before assessing whether WS ambigu- 
ity is critical, let us take a step back. 

It is well-established that "the problem of syntactic 



ambiguity is Al-complete" (Hobbs et al., 1992, p 269) 



Here, let us focus on one particular, but pervasive, va- 
riety of syntactic ambiguity: prepositional phrase (PP) 
attachment. A problem is Al-complete if its solution 
requires a solution to all the general AI problems of 
representing and reasoning about arbitrary real-world 
knowledge. In principle, any item of general knowledge 
might be the datum required to make a PP-attachment. 
If that is all that can be said, the outlook is bleak. We 
would hope that, in practice, a small and tractable sub- 
set of general knowledge will resolve a high proportion 
of ambiguities. 

Some approaches to high-quality parsing make exten- 
sive use of machine-readable dictionaries (MRDs). In 
the 1990s, Microsoft have been the leading proponents 
of 'MRDs-for-parsing'.H The hypothesis behind the ap- 
proach is that dictionary entries provide, implicitly or 
explicitly, the information required to resolve most syn- 
tactic ambiguities. 

Note that, even if this hypothesis is true, it does not 
imply that WSD has an important role to play. Lexi- 
cal information can resolve many syntactic ambiguities 
without being sense-disambiguated. Consider 

1 I love baking cakes with friends. 

2 I love baking cakes with butter icing. 

The PP attachment ambiguity is resolved, along with 
the ambiguity of with, by the semantic class of the final 
noun phrase. Where the head of this noun phrase is 

^The method is used in the parser embedded in 1997 Mi- 
crosoft Word's grammar checker, as demonstrated by Steve 
Richardson at the ACL Conference in Applied NLP, Wash- 
ington D.C., 1997. 



human, as in 1, the PP attaches to the verb. Where 
it is a cake ingredient, it attaches to cakes. Lexical in- 
formation is required to determine the attachment in 1 
and 2, but, since neither friends nor icing is ambiguous 
between humans and cake-ingredients, disambiguation 
is not required. 

That lexical information will resolve a high propor- 
tion of syntactic ambiguities is one hypothesis; that a 
significantly higher proportion will be resolved, if the 
lexical information is sense-specific, is another. 

Almost no work has been done to test either hypoth- 
esis. Whittemore, Ferrara, and Brunner (1990| ) tested 
and confirmed a related hypothesis: that 'lexical pref- 
erences' of nouns and verbs for PPs of a particular 
type are better predictors of PP-attachment than any 
purely syntactic considerations. They took a sample 
corpus and counted the PPs that would be correctly 
attached if each strategy was used. To discover the 
significance of WS-ambiguity to parsing, a study is re- 
quired which combines this method with Krovetz and 
Croft's, of manually disambiguating to determine the 
performance improvement that would be achieved with 
a perfect WSD program. 

Lexicography 

NLP is most aware of lexicographers as suppliers of 
wares, but they are also customers. A linguistically an- 
notated corpus is of more use to a lexicographer than 
a 'raw' one, as he or she can then investigate the be- 
haviour of a word in particular linguistic contexts with- 
out having to trawl through large numbers of irrelevant 
citations. A sense-annotated corpus would be partic- 
ularly valuable, as the lexicographer would not have 
to trawl through 'money bank^ citations when defining 
'river bank'' (Clear, 1994). There is then an intriguing 
possibility that the behaviour of WSD programs will 
feed back into the nature of the dictionary senses they 
disambiguate between. 

NLU 

For existing NLP applications requiring a deeper under- 
standing of the text, 99% of the ambiguity to be found 
in a desk dictionary is not relevant. This is, firstly, 
because these applications deal only with very specific 
text types. The specific sublanguage generally means 
that, if a word has a meaning which is of interest, it is 
very likely that occurrences of the word will be being 
used in that meaning and not some other. Secondly, 
even then the application can only interpret those in- 
puts for which there is a possible interpretation in the 
knowledge base (or in the system's output behaviour). 
Several respondents to the email survey, where I asked, 
"does WS ambiguity cause problems for your system?" , 



commented "We don't have any semantics in our lexi- 
con, we just have hooks into the knowledge representa- 
tion" . 

Where a word has one sense in the domain model, 
and one or more outside it, an NLU application can 
generally determine whether the word is being used in 
the domain sense by identifying whether the entire sen- 
tence or query is coherent in terms of the domain model. 
If it is, the word is almost certainly being used in the 
domain sense. Where a word has more than one do- 
main sense, it is unlikely that both will produce coher- 
ent analyses. The domain model will generally provide 
disambiguating material, not because it has been explic- 
itly added, but because type-checking and coherence- 
checking which is necessary in any case will reject in- 
valid senses. 

With time, NLU systems will become more sophisti- 
cated, with richer domain models and less limitations in 
the varieties of text they can analyse. This will make 
WSD more salient, though different strategies will be 
relevant for the 'foreground lexicon' containing the key 
words for the domain model, and the 'background lex- 
icon', containing all other words. Foreground lexicon 
senses will be tightly-defined and domain-specific, and 
will be disambiguated by coherence-checking. Back- 
ground lexicon disambiguation will only need to be be- 
tween coarse-grained senses. Its function will be to in- 
crease parse accuracy, and statistical methods will be 



appropriate. (The full argument is presented in Kilgar- 

riff (ifgl.) 



Answers 

The answers to the question, "Does WS ambiguity 
cause problems for NLP applications?" are: 

IR: yes, to some moderate degree. Problems can 
substantially be overcome by using longer queries. 
Within IR, WSD features as something of an alter- 
native to NLP. 

MT: yes. Huge problem, with the problem space de- 
fined by all the one-to-many and many-to-many map- 
pings in a bilingual dictionary. Addressed to date by 
lots and lots of selection restrictions. 

Parsing: not known. 

Lexicography: yes, WSD would be of benefit. 

NLU: not much. NLU applications are mostly domain 
specific, and have some sort of domain model. It 
is generally necessary to have a detailed knowledge 
of the word senses that are in the domain, so the 
knowledge to disambiguate will often be available in 
the domain model even where it has not explicitly 
been added for disambiguation purposes. 
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